Food & Agriculture
GTPBD: AFine-Grained Global Terraced Parcel and Boundary Dataset
Agricultural parcels serve as basic units for conducting agricultural practices and applications, which is vital for land ownership registration, food security assessment, soil erosion monitoring, etc. However, existing agriculture parcel extraction studies only focus on mid-resolution mapping or regular plain farmlands while lacking representation of complex terraced terrains due to the demands of precision agriculture. In this paper, we introduce a more fine-grained terraced parcel dataset named GTPBD (Global Terraced Parcel and Boundary Dataset), which is the first fine-grained dataset covering major worldwide terraced regions with more than 200,000 complex terraced parcels with manually annotation. GTPBD comprises 47,537 high-resolution images with three-level labels, including pixel-level boundary labels, mask labels, and parcel labels. It covers seven major geographic zones in China and transcontinental climatic regions around the world. Compared to the existing datasets, the GTPBD dataset brings considerable challenges due to the: (1) terrain diversity; (2) complex and irregular parcel objects; and (3) multiple domain styles. Our proposed GTPBD dataset is suitable for four different tasks, including semantic segmentation, edge detection, terraced parcel extraction and unsupervised domain adaptation (UDA) tasks.
Preference Learning with Lie Detectors can Induce Honesty or Evasion
As AI systems become more capable, deceptive behaviors can undermine evaluation and mislead users at deployment. Recent work has shown that lie detectors can accurately classify deceptive behavior, but they are not typically used in the training pipeline due to concerns around contamination and objective hacking. We examine these concerns by incorporating a lie detector into the labelling step of LLM post-training and evaluating whether the learned policy is genuinely more honest, or instead learns to fool the lie detector while remaining deceptive. Using DolusChat, a novel 65k-example dataset with paired truthful/deceptive responses, we identify three key factors that determine the honesty of learned policies: amount of exploration during preference learning, lie detector accuracy, and KL regularization strength. We find that preference learning with lie detectors and GRPO can lead to policies which evade lie detectors, with deception rates of over 85%. However, if the lie detector true positive rate (TPR) or KL regularization is sufficiently high, GRPO learns honest policies. In contrast, off-policy algorithms (DPO) consistently lead to deception rates under 25% for realistic TPRs. Our results illustrate a more complex picture than previously assumed: depending on the context, lie-detector-enhanced training can be a powerful tool for scalable oversight, or a counterproductive method encouraging undetectable misalignment.
SciArena: An Open Evaluation Platform for Non-Verifiable Scientific Literature-Grounded Tasks
Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons. By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses. The platform currently supports 47 foundation models and has collected over 20,000 votes from human researchers across diverse scientific domains. Our analysis of the data collected so far confirms its high quality. We discuss the results and insights based on the model ranking leaderboard. To further promote research in building modelbased automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on collected preference data. It measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark's challenges and emphasize the need for more reliable automated evaluation methods.
CPO: Condition Preference Optimization for Controllable Image Generation
To enhance controllability in text-to-image generation, ControlNet introduces image-based control signals, while ControlNet++ improves pixel-level cycle consistency between generated images and the input control signal. To avoid the prohibitive cost of back-propagating through the sampling process, ControlNet++ optimizes only low-noise timesteps (e.g., t < 200) using a single-step approximation, which not only ignores the contribution of high-noise timesteps but also introduces additional approximation errors. A straightforward alternative for optimizing controllability across all timesteps is Direct Preference Optimization (DPO), a fine-tuning method that increases model preference for more controllable images (Iw) over less controllable ones (Il). However, due to uncertainty in generative models, it is difficult to ensure that win-lose image pairs differ only in controllability while keeping other factors, such as image quality, fixed. To address this, we propose performing preference learning over control conditions rather than generated images.
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
Large language models (LMs) often struggle to generate diverse, human-like creative content, raising concerns about the long-term homogenization of human thought through repeated exposure to similar outputs. Yet scalable methods for evaluating LM output diversity remain limited, especially beyond narrow tasks such as random number or name generation, or beyond repeated sampling from a single model. To address this gap, we introduce INFINITY-CHAT, a largescale dataset of 26K diverse, real-world, open-ended user queries that admit a wide range of plausible answers with no single ground truth. We introduce the first comprehensive taxonomy for characterizing the full spectrum of open-ended prompts posed to LMs, comprising 6 top-level categories (e.g., creative content generation, brainstorm & ideation) that further breaks down to 17 subcategories.
Supplementary Materials AGMMU: AComprehensive Agricultural Multimodal Understanding Benchmark Aruna Gauba1,2,5 Irene Pi1,3,5 Yunze Man1,4,5 Ziqi Pang1,4,5 Vikram S. Adve1,4,5 Yu-Xiong Wang1,4,5
Our evaluation and analysis are conducted mainly on the group of models listed in Table 2 in the13 main paper. We have chosen models such that they cover most of the popular and best-performing14 methods used by recent multimodal understanding work. In this part, we discuss all the models we15 have used in our experiments and explain their evaluation details, the public checkpoints we have16 chosen, and display the prompts we used to adapt the model to our datasets.17 During evaluation, we chose to follow the standard prompt provided by the authors whenever possi-18 ble for multiple-choice and short-answer questions. When the prompt is not provided for the model,19 we select a custom prompt that is created through several iterations of prompt engineering to select20 the one that produces the most effective results. The images are always included as the prefix.21 We used three proprietary models in our evaluation: GPT-o4-mini [1], Gem-22 ini 1.5 Pro [9], and Claude 3 Haiku [10]. Below we note the model API version used for evaluation.23 GPT-o4-mini: May 13-15, 2025.24 Cambrian-1 is a recent state-of-the-art model that excels at visual-centric tasks.27 This model explores combinations of vision encoders, text and image integration techniques, and28 instruction tuning strategies. We use the official implementation and checkpoint1 with a LLaMA3-29 8B-Instruct LLM backbone model in our evaluation.30 InternVL scales up the vision foundation model while aligning it with the back-31 bone LLM, and is trained on web-scale image-text data to achieve strong performance across a vari-32 ety of vision-centric tasks. We use the official implementation and checkpoint2 with the InternViT-33 300M-448px vision backbone and Internlm2.5-7B-chat LLaMA-3.2 is the first collection of multimodal large language model from the35 LLaMA family that was previously text-only. The integration of vision involves utilizing cross-36 attention layers and a pre-trained vision encoder that feeds directly into the text-processor. The37 model follows a commonly used training recipe that includes pretraining on noisy image-text pairs38 and then high-quality knowledge enhanced pairs. Notably, the language-model parameters were39 frozen during the training of alignment of image and text to retain strong text-only capabilities. We40 use the official implementation and checkpoint3 that uses a LLaMA-3.1 text-only language backbone41 in our evaluation. When evaluating the model, we choose to use a custom prompt since no standard42 prompt is provided.43
AGMMU: AComprehensive Agricultural Multimodal Understanding Benchmark
Unlike prior datasets that rely on crowdsourced prompts, AGMMU is distilled from 116,231 authentic dialogues between everyday growers and USDAauthorized Cooperative Extension experts. Through a three-stage pipeline: automated knowledge extraction, QA generation, and human verification, we construct (i) AGMMU, an evaluation set of 746 multiple-choice questions (MCQs) and 746 open-ended questions (OEQs), and (ii) AGBASE, a development corpus of 57,079 multimodal facts covering five high-stakes agricultural topics: insect identification, species identification, disease categorization, symptom description, and management instruction. AGMMU has three key advantages: Authentic & Expert-Verified: All facts, images, and answers originate from real farmer and gardener inquiries answered by credentialed specialists, ensuring high-fidelity agricultural knowledge. Complete Development Suite: AGMMU uniquely couples a dual-format evaluation benchmark (MCQ and OEQ) with AGBASE, a large-scale training set, enabling both rigorous assessment and targeted improvement of VLMs. Knowledge-intensive Challenge: Our tasks demand the synergy of nuanced visual perception and domain expertise, exposing fundamental limitations of current general-purpose models and charting a path toward robust, application-ready agricultural AI. Benchmarking 12 leading VLMs reveals pronounced gaps in fine-grained perception and factual grounding. Open-sourced models trail after proprietary ones by a wide margin. Simple fine-tuning on AGBASE boosts open-sourced model performance on challenging OEQs for up to 11.6% on average, narrowing this gap and also motivating future research to propose better strategies in knowledge extraction and distillation from AGBASE. We hope AGMMU stimulates research on domain-specific knowledge integration and trustworthy decision support in agriculture AI development.
Young humpback whale freed from fishing line near Cape Cod
The whale sustained some injuries during the ordeal, but should recover. More information Adding us as a Preferred Source in Google by using this link indicates that you would like to see more of our content in Google News results. View of the whale after being freed. Note the red wounds from its most recent entanglement near the tail and the deep, but healing wound, near its head from a prior entanglement. Center for Coastal Studies image, taken under NOAA permit 24359.
DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in visual understanding, but their application to long-term Earth observation analysis remains limited, primarily focusing on single-temporal or bi-temporal imagery. To address this gap, we introduce DVL-Suite, a comprehensive framework for analyzing long-term urban dynamics through remote sensing imagery. Our suite comprises 14,871 high-resolution (1.0m) multi-temporal images spanning 42 major cities in the U.S. from 2005 to 2023, organized into two components: DVLBench and DVL-Instruct. The DVL-Bench includes six urban understanding tasks, from fundamental change detection (pixel-level) to quantitative analyses (regionallevel) and comprehensive urban narratives (scene-level), capturing diverse urban dynamics including expansion/transformation patterns, disaster assessment, and environmental challenges. We evaluate 18 state-of-the-art MLLMs and reveal their limitations in long-term temporal understanding and quantitative analysis. These challenges motivate the creation of DVL-Instruct, a specialized instruction-tuning dataset designed to enhance models' capabilities in multi-temporal Earth observation. Building upon this dataset, we develop DVLChat, a baseline model capable of both image-level question-answering and pixel-level segmentation, facilitating a comprehensive understanding of city dynamics through language interactions.